The tidyverse is loaded in the code chunk below. The visualization package, ggplot2, and the data manipulation package, dplyr, are part of the “larger” tidyverse.
The modelr package is loaded in the code chunk below. From modelr we could use functions to calculate performance metrics for your models.
library(modelr)
Let’s now load in that model, but assign it to a different variable
name. We can read in an .rds file with the
readr::read_rds() function.
train_dataset <- readr::read_rds("my_train_data.rds")
train_dataset |> glimpse()
## Rows: 835
## Columns: 15
## $ R <dbl> 172, 26, 172, 28, 170, 175, 90, 194, 171, 122, 0, 88, 14…
## $ G <dbl> 58, 88, 94, 87, 66, 89, 78, 106, 68, 151, 121, 140, 82, …
## $ B <dbl> 62, 151, 58, 152, 58, 65, 136, 53, 107, 59, 88, 58, 132,…
## $ Lightness <chr> "dark", "dark", "dark", "dark", "dark", "dark", "dark", …
## $ Saturation <chr> "bright", "bright", "bright", "bright", "bright", "brigh…
## $ Hue <dbl> 4, 31, 8, 32, 5, 6, 34, 10, 1, 21, 24, 22, 36, 16, 26, 1…
## $ response <dbl> 12, 10, 16, 10, 11, 16, 10, 19, 14, 25, 14, 19, 14, 38, …
## $ outcome <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ LightnessNum <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ SaturationNum <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ y <dbl> -1.9924302, -2.1972246, -1.6582281, -2.1972246, -2.09074…
## $ logit_R <dbl> 0.72865387, -2.17562547, 0.72865387, -2.09274551, 0.6931…
## $ logit_G <dbl> -Inf, -1.62924054, -1.40691365, -1.66965677, -3.08534443…
## $ logit_B <dbl> -2.4624338, 0.1784828, -2.7950616, 0.1996131, -2.7950616…
## $ logit_Hue <dbl> -2.36712361, 1.79175947, -1.38629436, 2.04769284, -2.047…
df <- readr::read_rds("df.rds")
# Model 1: Intercept-only model
mod1 <- lm(y ~ 1, data = train_dataset)
# Model 2: Categorical variables only – linear additive
mod2 <- lm(y ~ Lightness + Saturation, data = train_dataset)
# Model 3: Continuous variables only - linear additive
mod3 <- lm(y ~ R + G + B + Hue, data = train_dataset)
# Model 4: All categorical and continuous variables - linear additive
mod4 <- lm(y ~ Lightness + Saturation + R + G + B + Hue, data = train_dataset)
# Model 5: Interaction of categorical inputs with all continuous inputs main effects
mod5 <- lm(y ~ Lightness * R + Lightness * G + Lightness * B + Lightness * Hue +
Saturation * R + Saturation * G + Saturation * B + Saturation * Hue, data = train_dataset)
# Model 6: Add categorical inputs to all main effect and all pairwise interactions of continuous inputs
mod6 <- lm(y ~ (R + G + B + Hue)^2 + Lightness + Saturation, data = train_dataset)
# Model 7: Interaction of the categorical inputs with all main effect and all pairwise interactions of continuous inputs
mod7 <- lm(y ~ (Lightness + Saturation) * (R + G + B + Hue)^2, data = train_dataset)
library(splines)
# Model 8: Model with basis functions of your choice
mod8 <- lm(y ~ ns(Hue, df = 3), data = train_dataset)
# Model 9: Model with non-linear basis functions based on EDA
# Using natural cubic splines for continuous variables
mod9 <- lm(y ~ ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3) + ns(Hue, df = 3), data = train_dataset)
# Model 10: Model considering interactions of basis functions with other basis functions and categorical inputs
# Interaction of spline basis functions with LightnessNum
mod10 <- lm(y ~ (ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3)) * Lightness, data = train_dataset)
Now that we have fit multiple models of varying complexity, it is time to identify the best performing model. I will be identifying the best model considering training set only performance metrics. I want to consider R-squared, AIC and BIC to find the best models.
R-squared, or the coefficient of determination, is a statistical measure that shows how well the data fit a regression model.R-squared ranges from 0 to 1, with 0 indicating that the model does not explain any variability, and 1 indicating that it explains all the variability.
The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are two criteria used to select regression models. They are both penalized-likelihood information criteria and are used to estimate the likelihood of a model to predict future values.
I’ve created a function called extract_metrics() that
encapsulates the broom::glance() function. This allows me
to incorporate the model name along with the performance metrics.
extract_metrics <- function(mod, mod_name)
{
broom::glance(mod) %>% mutate(mod_name = mod_name)
}
The extract_metrics() function is systematically applied
to each lm() model object, and the outcomes are aggregated
into a dataframe using purrr::map_dfr().
all_metrics <- purrr::map2_dfr(list(mod1, mod2, mod3, mod4, mod5, mod6, mod7, mod8, mod9, mod10),
as.character(1:10),
extract_metrics)
A glimpse of the all_metrics object is shown below. Each row is a model and the columns correspond to various performance metrics associated with the training set performance.
all_metrics %>% glimpse()
## Rows: 10
## Columns: 13
## $ r.squared <dbl> 0.00000000, 0.88463262, 0.98810380, 0.99450819, 0.997808…
## $ adj.r.squared <dbl> 0.00000000, 0.88294843, 0.98804647, 0.99440077, 0.997626…
## $ sigma <dbl> 1.18434210, 0.40519660, 0.12948675, 0.08862198, 0.057703…
## $ statistic <dbl> NA, 525.25537, 17235.04027, 9258.18445, 5477.47518, 8494…
## $ p.value <dbl> NA, 0.000000000, 0.000000000, 0.000000000, 0.000000000, …
## $ df <dbl> NA, 12, 4, 16, 64, 22, 142, 3, 12, 69
## $ logLik <dbl> -1325.5849, -423.9378, 524.5814, 847.2924, 1230.8036, 94…
## $ AIC <dbl> 2655.1698, 875.8757, -1037.1628, -1658.5849, -2329.6072,…
## $ BIC <dbl> 2664.6246, 942.0597, -1008.7982, -1573.4911, -2017.5967,…
## $ deviance <dbl> 1169.823619, 134.959484, 13.916459, 6.424454, 2.563881, …
## $ df.residual <int> 834, 822, 830, 818, 770, 812, 692, 831, 822, 765
## $ nobs <int> 835, 835, 835, 835, 835, 835, 835, 835, 835, 835
## $ mod_name <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"
# top 3 models according to r squared.
top_3_models <- all_metrics %>%
arrange(desc(r.squared)) %>%
head(3)
print (top_3_models)
## # A tibble: 3 × 13
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.999 0.999 0.0449 4089. 0 142 1486. -2683. -2002.
## 2 0.998 0.998 0.0521 6223. 0 69 1318. -2494. -2158.
## 3 0.998 0.998 0.0577 5477. 0 64 1231. -2330. -2018.
## # ℹ 4 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## # mod_name <chr>
# Filter the data to select the top 3 models with the least AIC values
top_3_least_AIC <- all_metrics %>%
arrange(AIC) %>%
head(3)
# View the top 3 models with least AIC values
top_3_least_AIC
## # A tibble: 3 × 13
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.999 0.999 0.0449 4089. 0 142 1486. -2683. -2002.
## 2 0.998 0.998 0.0521 6223. 0 69 1318. -2494. -2158.
## 3 0.998 0.998 0.0577 5477. 0 64 1231. -2330. -2018.
## # ℹ 4 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## # mod_name <chr>
top_3_least_BIC <- all_metrics %>%
arrange(BIC) %>%
head(3)
# View the top 3 models with least AIC values
top_3_least_BIC
## # A tibble: 3 × 13
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.998 0.998 0.0521 6223. 0 69 1318. -2494. -2158.
## 2 0.997 0.997 0.0666 21893. 0 12 1083. -2139. -2073.
## 3 0.998 0.998 0.0577 5477. 0 64 1231. -2330. -2018.
## # ℹ 4 more variables: deviance <dbl>, df.residual <int>, nobs <int>,
## # mod_name <chr>
We are focusing on R-squared, AIC, and BIC. Those three metrics are visualized in a single graphic in the figure below.The x-axis is the model name displayed as an integer. The facet corresponds to the metric with AIC on the left, BIC in the middle, and R-squared displayed in the right facet.
all_metrics %>%
select(mod_name, df, r.squared, AIC, BIC) %>%
pivot_longer(!c("mod_name", "df")) %>%
ggplot(mapping = aes(x = mod_name, y = value)) +
geom_point(size = 1) +
facet_wrap(~name, scales = "free_y") +
theme_bw()
Despite model 7 having the highest training set \(R^2\), indicating better performance on the training data, BIC penalize models based on the number of parameters they estimate. Lower AIC and BIC values are preferred, as they indicate better model fit with fewer parameters.
Although model 7 appears to perform the best on the training set, the penalty term in BIC suggests that the added flexibility introduced by this model may not be worth it. Model 5 consistently ranks 3rd highest according to both AIC and BIC, implying that its balance between model complexity and fit to the data is optimal. Therefore, despite model 10 having lower performance based on \(R^2\), its simpler structure may lead to better generalization to new data compared to model 7, which is prone to overfitting due to its greater complexity. #### Saving the model
mod10 %>% readr::write_rds("mod10.rds")
Visualize the coefficient summaries for your top 3 models according to BIC and coefficient summaries comparison between the top 3 models
The top 3 models according to BIC are 10, 9, 8.
# mod5 coefficient summaries
mod5 %>% coefplot::coefplot() +
theme_bw() +
theme(legend.position = 'none')
mod5 %>% coef() %>% length()
## [1] 65
As we see below 11 features are statistically significant! None of these 11 features’s 95% confidence intervals contain zero. However there are 65 features and only 11 of them are significant. Lightness - Pale, Lightness light, lightness soft, Saturation Nuetral, saturation gray, lightness midtone, saturation shaded, saturation subdued, saturation muted - These input values are statistically significant in this model.
# mod9 coefficient summaries
mod9 %>% coefplot::coefplot() +
theme_bw() +
theme(legend.position = 'none')
mod9 %>% coef() %>% length()
## [1] 13
What I inferred from the coef summary is that:
The statistically significant coefficients for the natural splines of R and G suggest that there is a significant linear relationship between the red (R) and green (G) color components and the response variable.
The statistically significant coefficients for the natural splines of R and G suggest that there is a significant linear relationship between the red (R) and green (G) color components and the response variable. This indicates that changes in the intensity of red and green colors may have a notable impact on the response variable, while changes in blue (B) and hue may not be as influential in this model.
Overall, the model suggests that variations in R and G values are more closely associated with changes in the response variable compared to variations in B and hue values.
# mod10 coefficient summaries
mod10 %>% coefplot::coefplot() +
theme_bw() +
theme(legend.position = 'none')
mod10 %>% coef() %>% length()
## [1] 70
The coefficient summary plot below reveals a large number of features in the model. It does not appear that any of the features are statistically significant. The mod10 model has many more features than just the number of inputs in our data set. This a reminder that we can derive (calculate) as many features as we want from the inputs. The number of features of a linear model does not have to equal the number of inputs!
As shown by the figure below, the mod10 model has 69 features plus the intercept. Thus, there are total of 70 coefficients that needed to be estimated. Including the quadratic polynomials and their interactions have completely changed the interpretation of significant features! The linear main-effect features are no longer statistically significant. The interaction between the inputs is also not statistically significant!
Inputs with significant coefficients are considered important predictors. The inputs with significant co-efficients in model 5 and 9 are Lightness - Pale, Lightness light, lightness soft, Saturation Nuetral, saturation gray, lightness midtone, saturation shaded, saturation subdued, saturation muted, R, G, B.
mod10 %>% readr::write_rds("mod10.rds")
According to BIC computed in the previous part (Part2), the best models are 10 and 9.
Model 10 considers interactions between basis functions (natural splines for R, G, and B) and a categorical input variable, Lightness. This model captures the combined effect of the spline basis functions with the Lightness variable, allowing for a more nuanced representation of the relationship between the predictor variables and the response.
Model 10: Model considering interactions of basis functions with other basis functions and categorical inputs.
mod10 <- lm(y ~ (ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3)) * Lightness, data = train_dataset)
Model 9 utilizes natural cubic splines for the continuous variables R, G, B, and Hue. Natural cubic splines are a type of non-linear basis function used to capture complex relationships between the predictor variables and the response variable. By specifying the degrees of freedom (df = 3), the flexibility of the spline functions is controlled, allowing them to adapt to the underlying data patterns without overfitting.
Using natural cubic splines for continuous variables
mod9 <- lm(y ~ ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3) + ns(Hue, df = 3), data = train_dataset)
These models are selected as the top-performing models according to the Bayesian Information Criterion (BIC) because they strike a balance between model complexity and goodness of fit. BIC penalizes models based on the number of parameters they estimate, favoring simpler models that achieve a good fit to the data. Models 9 and 10 likely achieve this balance by capturing the non-linear relationships and interactions in the data while avoiding excessive complexity that could lead to overfitting.
I will perform the Bayesian analysis using the Laplace Approximation. I will fit the Bayesian linear model with the Laplace Approximation by programming the log-posterior function. I would like to understand how model behaves with weak prior and strong prior.
Before proceeding, we’ll need to compile a list of essential details. This compilation should encompass the observed response, the design matrix, and the prior specification. In this setup, Gaussian priors, independent for each regression parameter, with a common mean and standard deviation, will be applied. Additionally, an Exponential prior will be utilized for the unknown likelihood noise (the σ parameter). Once this information is gathered, you can proceed to define the log-posterior function, mirroring the approach taken in previous assignments.
Creating design matrix following mod09’s formula, and assign the object to the X09 variable. Complete the info_09_weak list by assigning the response to yobs and the design matrix to design_matrix. The shared prior mean, mu_beta, will be 0, the shared prior standard deviation, tau_beta, 50, and the rate parameter on the noise, sigma_rate, to be 1.
# Create design matrix following mod09’s formula
X09 <- model.matrix(y ~ ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3) + ns(Hue, df = 3), data = train_dataset)
# Complete the info_09_weak list
info_09_weak <- list(
yobs = train_dataset$y,
design_matrix = X09,
mu_beta = 0,
tau_beta = 50,
sigma_rate = 1
)
The second code chunk will create the design matrix associated with mod10’s formula and assign the object to the X10 variable. Assign X10 to the design_matrix field of the info_10_weak list.
# Create design matrix following mod09’s formula
X10 <- model.matrix(y ~ (ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3)) * Lightness, data = train_dataset)
# Complete the info_10_weak list
info_10_weak <- list(
yobs = train_dataset$y,
design_matrix = X10,
mu_beta = 0,
tau_beta = 50,
sigma_rate = 1
)
You will use the log-transformation on σ, and define the log-posterior in terms of the mean trend β-parameters and the unbounded noise parameter, φ=log[σ].
The unknown parameters to learn are contained within the first input argument, unknowns. The assumption is that the unknown β-parameters are listed before the unknown φ parameter in the unknowns vector. You must specify the number of β parameters programmatically to allow scaling up your function to an arbitrary number of unknowns. You will assume that all variables contained in the my_info list (the second argument to lm_logpost()) are the same fields in the info_09_weak list.
lm_logpost <- function(unknowns, my_info)
{
# specify the number of unknown beta parameters
length_beta <- ncol(my_info$design_matrix)
# extract the beta parameters from the `unknowns` vector
beta_v <- unknowns[1:length_beta]
# extract the unbounded noise parameter, varphi
lik_varphi <- unknowns[length_beta + 1]
# back-transform from varphi to sigma
lik_sigma <- exp(lik_varphi)
# extract design matrix
X <- my_info$design_matrix
# calculate the linear predictor
mu <- as.vector( X %*% as.matrix(beta_v) )
# evaluate the log-likelihood
log_lik <- sum(dnorm(x = my_info$yobs,
mean = mu,
sd = lik_sigma,
log = TRUE))
# evaluate the log-prior
log_prior_beta <- sum(dnorm(x = beta_v,
mean = my_info$mu_beta,
sd = my_info$tau_beta,
log = TRUE))
log_prior_sigma <- dexp(x = lik_sigma,
rate = my_info$sigma_rate,
log = TRUE)
# add the mean trend prior and noise prior together
log_prior <- log_prior_beta + log_prior_sigma
# account for the transformation
log_derive_adjust <- lik_varphi
# sum together
log_lik + log_prior + log_derive_adjust
}
This function executes the laplace approximation and returns the object consisting of the posterior mode, posterior covariance matrix, and the log-evidence.
my_laplace <- function(start_guess, logpost_func, ...)
{
# code adapted from the `LearnBayes`` function `laplace()`
fit <- optim(start_guess,
logpost_func,
gr = NULL,
...,
method = "BFGS",
hessian = TRUE,
control = list(fnscale = -1, maxit = 1001))
mode <- fit$par
post_var_matrix <- -solve(fit$hessian)
p <- length(mode)
int <- p/2 * log(2 * pi) + 0.5 * log(det(post_var_matrix)) + logpost_func(mode, ...)
# package all of the results into a list
list(mode = mode,
var_matrix = post_var_matrix,
log_evidence = int,
converge = ifelse(fit$convergence == 0,
"YES",
"NO"),
iter_counts = as.numeric(fit$counts[1]))
}
The model 10 result to the laplace_10_weak object, and assign the model 9 result to the laplace_09_weak object
laplace_10_weak <- my_laplace(rep(0, ncol(X10)+1), lm_logpost, info_10_weak)
laplace_09_weak <- my_laplace(rep(0, ncol(X09)+1), lm_logpost, info_09_weak)
As shown below, the optimization schemes for both models converged.
laplace_10_weak$converge
## [1] "YES"
laplace_09_weak$converge
## [1] "YES"
The first argument is the vector of posterior means, and the second argument is the vector of posterior standard deviations. The third argument is the name of the feature associated with each coefficient.
viz_post_coefs <- function(post_means, post_sds, xnames)
{
tibble::tibble(
mu = post_means,
sd = post_sds,
x = xnames
) %>%
mutate(x = factor(x, levels = xnames)) %>%
ggplot(mapping = aes(x = x)) +
geom_hline(yintercept = 0, color = 'grey', linetype = 'dashed') +
geom_point(mapping = aes(y = mu)) +
geom_linerange(mapping = aes(ymin = mu - 2 * sd,
ymax = mu + 2 * sd,
group = x)) +
labs(x = 'feature', y = 'coefficient value') +
coord_flip() +
theme_bw()
}
The function needs the posterior means and standard deviations of the regression coefficients (the β parameters).
The coefficient posterior summaries for mod10 are visualized below. Notice that the number of columns in the X10 design matrix are used to identify the number of regression coefficients (beta parameters).
viz_post_coefs(laplace_10_weak$mode[1:ncol(X10)],
sqrt(diag(laplace_10_weak$var_matrix)[1:ncol(X10)]),
colnames(X10))
Likewise, the posterior coefficient summaries for mod09 are shown below. Again the number of columns in the design matrix is used to identify the regression coefficients.
viz_post_coefs(laplace_09_weak$mode[1:ncol(X09)],
sqrt(diag(laplace_09_weak$var_matrix)[1:ncol(X09)]),
colnames(X09))
The Bayes Factor is calculated using log-arithmetic.
exp( laplace_09_weak$log_evidence - laplace_10_weak$log_evidence )
## [1] 6.259728e+47
Conclusion: The best performing model with weak prior is Model 09.
Trying a more informative or strong prior by reducing the prior standard deviation on the regression coefficients from 50 to 1. The prior mean will still be zero.
The below code block defines the list of required information for both the model 10 and model 9 formulations using the strong prior on the regression coefficients. All other information, data and the σ prior, are the same as weak prior.
Define the lists of required information for the strong prior.
info_10_strong <- list(
yobs = df$y,
design_matrix = X10,
mu_beta = 0,
tau_beta = 1,
sigma_rate = 1
)
info_09_strong <- list(
yobs = df$y,
design_matrix = X09,
mu_beta = 0,
tau_beta = 1,
sigma_rate = 1
)
Execute the Laplace Approximation.
laplace_10_strong <- my_laplace(rep(0, ncol(X10)+1), lm_logpost, info_10_strong)
laplace_09_strong <- my_laplace(rep(0, ncol(X09)+1), lm_logpost, info_09_strong)
As shown below, the optimizations did converge for both models.
purrr::map_chr(list(laplace_10_strong, laplace_09_strong), "converge")
## [1] "YES" "YES"
Model 10:
viz_post_coefs(laplace_10_strong$mode[1:ncol(X10)],
sqrt(diag(laplace_10_strong$var_matrix)[1:ncol(X10)]),
colnames(X10))
I think that coefficient posteriors are still very uncertain for mod10.
However, it is important to note the x-axis scale in the figure. The
figure below has all coefficient posterior means between -3 and +4. The
posterior coefficient summaries for mod10 associated with the weak prior
covered a much wider range (-50 to 50). Many of the coefficients had
posterior means in the double digits when we used the weak prior. The
more restrictive “strong” prior with prior standard deviation of 1 is
thus preventing the coefficients from reaching the values supported by
the data alone for mod10.
Model 9:
viz_post_coefs(laplace_09_strong$mode[1:ncol(X09)],
sqrt(diag(laplace_09_strong$var_matrix)[1:ncol(X09)]),
colnames(X09))
The posterior coefficient summaries for mod03 appear to be basically the same as what we saw with the weak prior. Our stronger prior therefore is restricting the coefficients to the range the data (likelihood) was supporting anyway. Thus, the posterior for this particular model and data is not influenced by this specific prior.
The Bayes Factor is calculated using log-arithmetic.
exp( laplace_09_strong$log_evidence - laplace_10_strong$log_evidence )
## [1] 4.892189e-37
The Bayes Factor reveals that mod09 is the better of the two models. The Bayes Factor is still so large, we should not consider mod10 an appropriate model compared to mod09.
According to Bayes factor for weak and strong prior, model 9 is more favoured than model 10.
As a performance metrics I would also like to consider AIC and BIC values. They are computed below:
extract_metrics <- function(mod, mod_name)
{
broom::glance(mod) %>% mutate(mod_name = mod_name)
}
all_bayes_metrics <- purrr::map2_dfr(list(mod9, mod10),
as.character(9:10),
extract_metrics)
all_bayes_metrics %>% glimpse()
## Rows: 2
## Columns: 13
## $ r.squared <dbl> 0.9968809, 0.9982216
## $ adj.r.squared <dbl> 0.9968354, 0.9980612
## $ sigma <dbl> 0.06662507, 0.05214884
## $ statistic <dbl> 21893.091, 6223.131
## $ p.value <dbl> 0, 0
## $ df <dbl> 12, 69
## $ logLik <dbl> 1083.481, 1318.041
## $ AIC <dbl> -2138.961, -2494.083
## $ BIC <dbl> -2072.777, -2158.435
## $ deviance <dbl> 3.648776, 2.080419
## $ df.residual <int> 822, 765
## $ nobs <int> 835, 835
## $ mod_name <chr> "9", "10"
According to both AIC and BIC: Mod10 is best performing model.
BIC penalizes model complexity more heavily than AIC, leading to a preference for simpler models. It’s a frequentist criterion based on the likelihood function. Simpler models might generalize better for prediction, while more complex models might offer deeper insights into the underlying data generation process.
laplace_09_strong %>% readr::write_rds("laplace_09_strong.rds")
laplace_10_strong %>% readr::write_rds("laplace_10_strong.rds")
The coefficient posterior summaries based on the strong prior and weak prior for mod10 are visualized below. At first glance, we may think that coefficient posteriors are still very uncertain for mod10. However, it is important to note the x-axis scale in the figure. The figure below has all coefficient posterior means between -2 and +4. The posterior coefficient summaries for mod10 associated with the weak prior covered a much wider range. Many of the coefficients had posterior means in the double digits when we used the weak prior. The more restrictive “strong” prior with prior standard deviation of 1 is thus preventing the coefficients from reaching the values supported by the data alone for mod10
viz_post_coefs(laplace_10_strong$mode[1:ncol(X10)],
sqrt(diag(laplace_10_strong$var_matrix)[1:ncol(X10)]),
colnames(X10))
viz_post_coefs(laplace_10_weak$mode[1:ncol(X10)],
sqrt(diag(laplace_10_weak$var_matrix)[1:ncol(X10)]),
colnames(X10))
The maximum likelihood estimate (MLE) of 𝜎 obtained from the lm() function represents a point estimate of the standard deviation parameter based on the observed data. It is derived by maximizing the likelihood function, which measures the probability of observing the data given a particular value of 𝜎.
In contrast, the posterior uncertainty on 𝜎, as obtained from Bayesian inference, reflects the distribution of possible values of 𝜎 given both the observed data and the prior information. This uncertainty is captured by the posterior distribution, which accounts for the entire range of plausible values of 𝜎 along with their respective probabilities.
The relationship between the MLE and the posterior uncertainty on 𝜎 depends on the specific prior chosen for the Bayesian analysis. A more informative prior, such as a strong prior with a smaller standard deviation, can constrain the posterior distribution, leading to reduced uncertainty compared to the MLE. Conversely, a less informative prior, such as a weak prior with a larger standard deviation, may result in a broader posterior distribution with greater uncertainty.
Let’s keep things simple as we did in assignments with an assumed value of σ=1.you can also think of this as examining the scaled uncertainty. You will begin by assuming that the prior does not matter and thus are assuming an infinitely diffuse prior or a prior with infinite uncertainty. You will begin by assuming that the prior does not matter and thus are assuming an infinitely diffuse prior or a prior with infinite uncertainty.
# Assuming σ = 1
sigma_squared = 1
# Compute the posterior covariance matrix for mod10
post_cov_matrix_mod10 = sigma_squared * solve(t(X10) %*% X10)
print("Number of rows and columns in mod10")
## [1] "Number of rows and columns in mod10"
print(post_cov_matrix_mod10 %>% dim())
## [1] 70 70
Do you feel the posterior is precise or are we quite uncertain about 𝜎?
The coefficient posterior summaries for mod10 associated with the weak prior covered a much wider range, with many coefficients having posterior means in the double digits. This indicates significant uncertainty in the estimates of the coefficients.
The use of a strong prior with a prior standard deviation of 1 is restricting the coefficients from reaching the values supported by the data alone, indicating uncertainty in the posterior estimates.
In Bayesian analysis, uncertainty is inherent in the posterior distribution, which captures the range of possible values of 𝜎 along with their respective probabilities. The presence of wide posterior distributions suggests uncertainty in estimating 𝜎.
It is now time to examine the predictive trends of the models to better interpret their behavior. I will visualize the trends on a specifically designed prediction grid. The code chunk below defines that grid for you using the expand.grid() function.
# Define unique values for Lightness and Saturation
lightness_values <- c("dark", "deep", "light", "midtone", "pale", "saturated", "soft")
saturation_values <- c("bright", "gray", "muted", "neutral", "pure", "subdued", "shaded")
# Generate the grid
viz_grid <- expand.grid(
R = sample(0:255, 2), # Adjust the number of values sampled as needed
G = sample(0:255, 2), # Adjust the number of values sampled as needed
B = sample(0:255, 2), # Adjust the number of values sampled as needed
Lightness = lightness_values,
Saturation = saturation_values,
Hue = sample(0:36, 2)
)
# Generate random integers for R, G, B for each row
viz_grid$R <- sample(0:255, nrow(viz_grid), replace = TRUE)
viz_grid$G <- sample(0:255, nrow(viz_grid), replace = TRUE)
viz_grid$B <- sample(0:255, nrow(viz_grid), replace = TRUE)
viz_grid$Hue <- sample(0:36, nrow(viz_grid), replace = TRUE)
# Display the structure of viz_grid
glimpse(viz_grid)
## Rows: 784
## Columns: 6
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 18, 233, 156, …
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 168, 184, 255, …
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253, 177, 243, 2,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15, 16, 30, 21,…
viz_grid %>% readr::write_rds("viz_grid.rds")
Posterior samples are generated and those samples are used to calculate the posterior samples of the mean trend and generate random posterior samples of the response around the mean. The glimpse provided above shows there are 784 combinations of the 6 inputs. You will therefore make over 700 predictions to study the trends of the event probability!
We will now summarize the posterior predictions of the mean and of the random response. we will make predictions for each of the models and visualize their trends. A function, tidy_predict(), is created for you which assembles the predicted mean trend, the confidence interval, and the prediction interval into a tibble for you. The result include the input values to streamline making the visualizations.
tidy_predict <- function(mod, xnew)
{
pred_df <- predict(mod, xnew, interval = "confidence") %>%
as.data.frame() %>% tibble::as_tibble() %>%
dplyr::select(pred = fit, ci_lwr = lwr, ci_upr = upr) %>%
bind_cols(predict(mod, xnew, interval = 'prediction') %>%
as.data.frame() %>% tibble::as_tibble() %>%
dplyr::select(pred_lwr = lwr, pred_upr = upr))
xnew %>% bind_cols(pred_df)
}
The first argument to the tidy_predict() function is a lm() model object and the second argument is new or test dataframe of inputs. When working with lm() and its predict() method, the functions will create the test design matrix consistent with the training design basis. It does so via the model object’s formula which is contained within the lm() model object.
Make predictions with 2 models selected from Part 1. I will be using the linear non bayesian models - mod9 and mod10 using the visualization grid, viz_grid. The predictions will be assigned to the variables pred_lm_09 and pred_lm_10
Predictions are made by the 2 models in the code chunk below.
pred_lm_02 <- tidy_predict(mod2, viz_grid)
pred_lm_03 <- tidy_predict(mod3, viz_grid)
pred_lm_04 <- tidy_predict(mod4, viz_grid)
pred_lm_05 <- tidy_predict(mod5, viz_grid)
pred_lm_06 <- tidy_predict(mod6, viz_grid)
pred_lm_07 <- tidy_predict(mod7, viz_grid)
pred_lm_08 <- tidy_predict(mod8, viz_grid)
pred_lm_09 <- tidy_predict(mod9, viz_grid)
pred_lm_10 <- tidy_predict(mod10, viz_grid)
A glimpse of the pred_lm_09 tibble is shown below.
pred_lm_09 %>% glimpse()
## Rows: 784
## Columns: 11
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 18, 233, 156, …
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 168, 184, 255, …
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253, 177, 243, 2,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15, 16, 30, 21,…
## $ pred <dbl> -0.4862815, -0.3834188, -2.3197105, -0.6999442, -3.2458863,…
## $ ci_lwr <dbl> -0.5246634, -0.4217871, -2.3537547, -0.7382088, -3.2908182,…
## $ ci_upr <dbl> -0.4478996, -0.3450505, -2.2856663, -0.6616795, -3.2009544,…
## $ pred_lwr <dbl> -0.6225729, -0.5197063, -2.4548444, -0.8362026, -3.3841652,…
## $ pred_upr <dbl> -0.3499901, -0.2471312, -2.1845765, -0.5636857, -3.1076074,…
A glimpse of the pred_lm_10 tibble is shown below.
pred_lm_10 %>% glimpse()
## Rows: 784
## Columns: 11
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 18, 233, 156, …
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 168, 184, 255, …
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253, 177, 243, 2,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15, 16, 30, 21,…
## $ pred <dbl> -0.70990173, -0.18483048, -2.33175595, -2.00457175, -3.1859…
## $ ci_lwr <dbl> -2.0182998, -0.2407187, -3.1791854, -8.1995298, -3.2319970,…
## $ ci_upr <dbl> 0.5984964, -0.1289422, -1.4843264, 4.1903863, -3.1398182, -…
## $ pred_lwr <dbl> -2.0222986, -0.3014645, -3.1853465, -8.2003756, -3.2981762,…
## $ pred_upr <dbl> 0.6024951, -0.0681965, -1.4781654, 4.1912321, -3.0736391, -…
The pred column in of each pred_lm_ objects is the predictive mean trend. The ci_lwr and ci_upr columns are the lower and upper bounds of the confidence interval, respectively. The pred_lwr and pred_upr columns are the lower and upper bounds of the prediction interval, respectively.
You will use ggplot() to visualize the predictions. You will use geom_line() to visualize the mean trend and geom_ribbon() to visualize the uncertainty intervals.
Visualize the predictions of each model on the visualization grid. Pipe the pred_lm_ object to ggplot() and map the variable to the x-aesthetic.
The predictions from mod1, mod2 and mod03 () are shown below. I wanted to visualize how the simpler models without complex interaction would be predicted.
# Assuming 'B' is the variable you want to use as a spline function in Model 9
pred_lm_02 %>%
ggplot(mapping = aes(x = G)) +
geom_ribbon(mapping = aes(ymin = pred_lwr, ymax = pred_upr), fill = 'orange') +
geom_line(mapping = aes(y = pred), color = 'black') +
coord_cartesian(ylim = c(-3, 3)) + # Adjust the limits as needed
facet_wrap(~Saturation, scales = "free", labeller = "label_both") + # Using 'Saturation' as the facet variable
theme_bw()
pred_lm_03 %>%
ggplot(mapping = aes(x = G)) +
geom_ribbon(mapping = aes(ymin = pred_lwr, ymax = pred_upr), fill = 'orange') +
geom_line(mapping = aes(y = pred), color = 'black') +
coord_cartesian(ylim = c(-3, 3)) + # Adjust the limits as needed
facet_wrap(~Saturation, scales = "free", labeller = "label_both") + # Using 'Saturation' as the facet variable
theme_bw()
To state if the predictive trends are consistent between the 2 selected linear models, the predictions are consistent with the coefficient summaries we looked at previously.
library(ggplot2)
# Define reference values for the remaining inputs (replace these with your actual reference values)
reference_values <- data.frame(
R = mean(pred_lm_09$R),
G = mean(pred_lm_09$G),
B = mean(pred_lm_09$B),
Lightness = mean(pred_lm_09$Lightness),
Hue = mean(pred_lm_09$Hue)
)
## Warning in mean.default(pred_lm_09$Lightness): argument is not numeric or
## logical: returning NA
# Plot predictive trends for G and Saturation
pred_lm_09 %>%
ggplot(aes(x = G)) +
geom_ribbon(aes(ymin = pred_lwr, ymax = pred_upr),
fill = 'orange') +
geom_ribbon(aes(ymin = ci_lwr, ymax = ci_upr),
fill = 'grey') +
geom_line(aes(y = pred),
color = 'black') +
coord_cartesian(ylim = c(min(pred_lm_09$ci_lwr), max(pred_lm_09$ci_upr))) +
facet_wrap(~ Saturation, scales = "free", labeller = "label_both") +
theme_bw() +
# Add reference lines for the remaining inputs
geom_vline(data = reference_values, aes(xintercept = Hue), linetype = "dashed", color = "pink") +
geom_vline(data = reference_values, aes(xintercept = R), linetype = "dashed", color = "green") +
geom_vline(data = reference_values, aes(xintercept = B), linetype = "dashed", color = "blue")
# Plot predictive trends for G and Lightness
pred_lm_09 %>%
ggplot(aes(x = G)) +
geom_ribbon(aes(ymin = pred_lwr, ymax = pred_upr),
fill = 'orange') +
geom_ribbon(aes(ymin = ci_lwr, ymax = ci_upr),
fill = 'grey') +
geom_line(aes(y = pred),
color = 'black') +
coord_cartesian(ylim = c(min(pred_lm_09$ci_lwr), max(pred_lm_09$ci_upr))) +
facet_wrap(~ Lightness, scales = "free", labeller = "label_both") +
theme_bw() +
geom_vline(data = reference_values, aes(xintercept = Hue), linetype = "dashed", color = "pink") +
geom_vline(data = reference_values, aes(xintercept = R), linetype = "dashed", color = "green") +
geom_vline(data = reference_values, aes(xintercept = B), linetype = "dashed", color = "blue")
## Warning: Combining variables of class <factor> and <numeric> was deprecated in ggplot2
## 3.4.0.
## ℹ Please ensure your variables are compatible before plotting (location:
## `combine_vars()`)
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Combining variables of class <numeric> and <factor> was deprecated in ggplot2
## 3.4.0.
## ℹ Please ensure your variables are compatible before plotting (location:
## `combine_vars()`)
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
- Hue did not show any statistical significance in the co-eff summaries,
to show this I have plotted the predictive trends and the confidence and
prediction intervals for model 9 below with Hue has primary input. The
graph shows that this behaviour is still consistent for the predictive
trend.
# Plot predictive trends for Hue and Lightness
pred_lm_09 %>%
ggplot(aes(x = Hue)) +
geom_ribbon(aes(ymin = pred_lwr, ymax = pred_upr),
fill = 'orange') +
geom_ribbon(aes(ymin = ci_lwr, ymax = ci_upr),
fill = 'grey') +
geom_line(aes(y = pred),
color = 'black') +
coord_cartesian(ylim = c(min(pred_lm_09$ci_lwr), max(pred_lm_09$ci_upr))) +
facet_wrap(~ Lightness, scales = "free", labeller = "label_both") +
theme_bw() +
geom_vline(data = reference_values, aes(xintercept = G), linetype = "dashed", color = "pink") +
geom_vline(data = reference_values, aes(xintercept = R), linetype = "dashed", color = "green") +
geom_vline(data = reference_values, aes(xintercept = B), linetype = "dashed", color = "blue")
#### Model 10:
We see the trend (the black line) become more and more “wiggly” and the confidence interval (grey ribbon) growing larger in size in model 10. The confidence interval is growing. Visually the pattern of both the models seem consistent with each other. Especially, the plot G vs Pred with Secondary input being Lightness, the values like deep, pale, saturated and soft are very much consisten with model 9.
pred_lm_10 %>%
ggplot(aes(x = G)) +
geom_ribbon(aes(ymin = pred_lwr, ymax = pred_upr),
fill = 'orange') +
geom_ribbon(aes(ymin = ci_lwr, ymax = ci_upr),
fill = 'grey') +
geom_line(aes(y = pred),
color = 'black') +
coord_cartesian(ylim = c(min(pred_lm_09$ci_lwr), max(pred_lm_09$ci_upr))) +
facet_wrap(~ Saturation, scales = "free", labeller = "label_both") +
theme_bw() +
# Add reference lines for the remaining inputs
geom_vline(data = reference_values, aes(xintercept = Hue), linetype = "dashed", color = "pink") +
geom_vline(data = reference_values, aes(xintercept = R), linetype = "dashed", color = "green") +
geom_vline(data = reference_values, aes(xintercept = B), linetype = "dashed", color = "blue")
# Plot predictive trends
pred_lm_10 %>%
ggplot(aes(x = G)) +
geom_ribbon(aes(ymin = pred_lwr, ymax = pred_upr),
fill = 'orange') +
geom_ribbon(aes(ymin = ci_lwr, ymax = ci_upr),
fill = 'grey') +
geom_line(aes(y = pred),
color = 'black') +
coord_cartesian(ylim = c(min(pred_lm_09$ci_lwr), max(pred_lm_09$ci_upr))) +
facet_wrap(~ Lightness, scales = "free", labeller = "label_both") +
theme_bw() +
geom_vline(data = reference_values, aes(xintercept = Hue), linetype = "dashed", color = "pink") +
geom_vline(data = reference_values, aes(xintercept = R), linetype = "dashed", color = "green") +
geom_vline(data = reference_values, aes(xintercept = B), linetype = "dashed", color = "blue")
### Plotting Hue vs Pred with a secondary input lightness or saturation
with model10
Hue is not statistically significant as we have seen the co-eff plots.
pred_lm_10 %>%
ggplot(aes(x = Hue)) +
geom_ribbon(aes(ymin = pred_lwr, ymax = pred_upr),
fill = 'orange') +
geom_ribbon(aes(ymin = ci_lwr, ymax = ci_upr),
fill = 'grey') +
geom_line(aes(y = pred),
color = 'black') +
coord_cartesian(ylim = c(min(pred_lm_09$ci_lwr), max(pred_lm_09$ci_upr))) +
facet_wrap(~ Lightness, scales = "free", labeller = "label_both") +
theme_bw() +
# Add reference lines for the remaining inputs
geom_vline(data = reference_values, aes(xintercept = G), linetype = "dashed", color = "pink") +
geom_vline(data = reference_values, aes(xintercept = R), linetype = "dashed", color = "green") +
geom_vline(data = reference_values, aes(xintercept = B), linetype = "dashed", color = "blue")
Instructions: 1. You must train, assess, tune, and compare more complex methods via resampling. 2. You may use either caret or tidymodels to handle the preprocessing, training, testing, and evaluation.
The code chunk below imports the caret package
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
In the Data Exploration part of the project, the categorical values in Lightness and Saturation Columns were reformatted to have integers. So, all the input variables like R, G, B, Hue, LightnessNum and SaturationNum are reformatted to be consistent with the caret preferred structure.
train_dataset %>% glimpse()
## Rows: 835
## Columns: 15
## $ R <dbl> 172, 26, 172, 28, 170, 175, 90, 194, 171, 122, 0, 88, 14…
## $ G <dbl> 58, 88, 94, 87, 66, 89, 78, 106, 68, 151, 121, 140, 82, …
## $ B <dbl> 62, 151, 58, 152, 58, 65, 136, 53, 107, 59, 88, 58, 132,…
## $ Lightness <chr> "dark", "dark", "dark", "dark", "dark", "dark", "dark", …
## $ Saturation <chr> "bright", "bright", "bright", "bright", "bright", "brigh…
## $ Hue <dbl> 4, 31, 8, 32, 5, 6, 34, 10, 1, 21, 24, 22, 36, 16, 26, 1…
## $ response <dbl> 12, 10, 16, 10, 11, 16, 10, 19, 14, 25, 14, 19, 14, 38, …
## $ outcome <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ LightnessNum <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ SaturationNum <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ y <dbl> -1.9924302, -2.1972246, -1.6582281, -2.1972246, -2.09074…
## $ logit_R <dbl> 0.72865387, -2.17562547, 0.72865387, -2.09274551, 0.6931…
## $ logit_G <dbl> -Inf, -1.62924054, -1.40691365, -1.66965677, -3.08534443…
## $ logit_B <dbl> -2.4624338, 0.1784828, -2.7950616, 0.1996131, -2.7950616…
## $ logit_Hue <dbl> -2.36712361, 1.79175947, -1.38629436, 2.04769284, -2.047…
Model 9: Model with non-linear basis functions based on EDA
mod9 <- lm(y ~ ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3) + ns(Hue, df = 3), data = train_dataset)
Model 10: Model considering interactions of basis functions with other basis functions and categorical inputs
mod10 <- lm(y ~ (ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3)) * Lightness, data = train_dataset)
Specify the resampling scheme that caret will use to train, assess, and tune a model.
The resampling scheme is specified by the trainControl() function in caret. The type of scheme is controlled by the method argument. For k-fold cross-validation, the method argument must equal ‘cv’ and the number of folds is controlled by the number argument. We could instruct caret to use repeated cross-validation by specifying method to be ‘repeatedcv’ and including the number of repeats via the repeates argument. However, we will follow the process consistent with lecture and use 5-fold cross-validation for this assignment.
Specify the resampling scheme:
my_ctrl <- trainControl(method = "repeatedcv", number = 5, savePredictions = 'final' )
I am using 5 fold cross-validation. Each model will be trainined 5 times and tested 5 times.
The caret package requires that we specify a primary performance metric of interest.
Selecting a primary performance metric to use to compare the models: We need to use a performance metric associated with linear regression. Thus, we can choose RMSE, MAE, or Rsquared. We are using 5 fold cross-validation and thus each fold’s holdout test set contains 20% of the train data.
my_metric <- 'RMSE'
set.seed(2001)
mod_lmc_4 <- train(y ~ Lightness + Saturation + R + G + B + Hue,
data = train_dataset,
method = "lm",
metric = my_metric,
trControl = my_ctrl)
mod_lmc_4
## Linear Regression
##
## 835 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 668, 668, 669, 667, 668
## Resampling results:
##
## RMSE Rsquared MAE
## 0.08961727 0.9943377 0.06734552
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The default model has RMSE value of 0.0896.
set.seed(2001)
mod_lmc_6 <- train(y ~ (R + G + B + Hue)^2 + Lightness + Saturation,
data = train_dataset,
method = "lm",
metric = my_metric,
trControl = my_ctrl)
mod_lmc_6
## Linear Regression
##
## 835 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 668, 668, 669, 667, 668
## Resampling results:
##
## RMSE Rsquared MAE
## 0.08053766 0.9954601 0.06089084
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
set.seed(2001)
mod_lmc_9 <- train(y ~ ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3) + ns(Hue, df = 3),
data = train_dataset,
method = "lm",
metric = my_metric,
trControl = my_ctrl)
mod_lmc_9
## Linear Regression
##
## 835 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 668, 668, 669, 667, 668
## Resampling results:
##
## RMSE Rsquared MAE
## 0.06751746 0.9967728 0.04612248
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
set.seed(2001)
mod_lmc_10 <- train(y ~ (ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3)) * Lightness,
data = train_dataset,
method = "lm",
metric = my_metric,
trControl = my_ctrl)
mod_lmc_10
## Linear Regression
##
## 835 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 668, 668, 669, 667, 668
## Resampling results:
##
## RMSE Rsquared MAE
## 0.06381023 0.9969635 0.039008
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The traindataset_results_resampling object can be used to compare the models through tables and visualizations.
traindataset_results_resampling = resamples(list(
fit_04 = mod_lmc_4,
fit_06 = mod_lmc_6,
fit_09 = mod_lmc_9,
fit_10 = mod_lmc_10
))
The caret package provides default summary and plot methods which rank the models based on their resampling hold-out results. The summary() function prints a table like object which summarizes the resampling results. The dotplot() function creates a dot plot with confidence intervals on the resampling performance metrics.
traindataset_results_resampling |> summary(metric = 'RMSE')
##
## Call:
## summary.resamples(object = traindataset_results_resampling, metric = "RMSE")
##
## Models: fit_04, fit_06, fit_09, fit_10
## Number of resamples: 5
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## fit_04 0.08350267 0.08652551 0.08872158 0.08961727 0.09217503 0.09716155 0
## fit_06 0.07421857 0.07906394 0.07919018 0.08053766 0.08024848 0.08996715 0
## fit_09 0.05901126 0.06521073 0.06581390 0.06751746 0.06667169 0.08087970 0
## fit_10 0.05022380 0.05431282 0.05571196 0.06381023 0.06604841 0.09275416 0
The model “fit_10” has the lowest mean RMSE (0.06381023), making it the best model according to the 5-fold cross-validation scheme.
mod_lmc_10 %>% readr::write_rds("mod_lmc_10.rds")
Let’s us also visualize the resampling results using the dotplot() function:
traindataset_results_resampling |> dotplot(metric = 'RMSE')
Now that we have trained linear models and identified the best model according to the RMSE performance metrics, we will now train Regularized regression with Elastic net.
Regularized regression techniques like Elastic Net (which combines L1 and L2 regularization) are commonly used in regression problems to prevent overfitting and improve the generalization ability of the model. When using Elastic Net, you typically tune hyperparameters like the mixing parameter (alpha) and the regularization parameter (lambda).
To evaluate the performance of an Elastic Net model, I will use RMSE as the evaluation metric. RMSE measures the average deviation of predicted values from actual values, and it’s a suitable metric for regression problems because it penalizes larger errors more heavily. Lower RMSE values indicate better model performance, where the predicted values are closer to the actual values.
Assign the method argument to ‘glmnet’ and set the metric argument to my_metric. Even though the inputs were standardized for you, you must also instruct caret to standardize the features by setting the preProcess argument equal to c(‘center’, ‘scale’). Assign the trControl argument to the my_ctrl object.
The caret::train() function works with the formula interface.
set.seed(1234)
enet_default_mod6 <- train( y ~ (R + G + B + Hue)^2 + Lightness + Saturation,
data = train_dataset,
method = 'glmnet',
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
enet_default_mod6
## glmnet
##
## 835 samples
## 6 predictor
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 668, 668, 668, 668, 668
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.002324946 0.08179611 0.9952651 0.06177323
## 0.10 0.023249462 0.08779076 0.9946092 0.06486433
## 0.10 0.232494618 0.14937783 0.9876432 0.11017555
## 0.55 0.002324946 0.08383331 0.9950262 0.06251216
## 0.55 0.023249462 0.09643629 0.9935995 0.06990620
## 0.55 0.232494618 0.20085400 0.9893951 0.14960444
## 1.00 0.002324946 0.08446765 0.9949473 0.06263570
## 1.00 0.023249462 0.10221982 0.9929677 0.07338187
## 1.00 0.232494618 0.26695708 0.9892905 0.20873090
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.002324946.
Create a tuning grid with the expand.grid() function which has two columns named alpha and lambda. The alpha variable should be evenly spaced between 0.1 and 1.0 by increments of 0.1. The lambda variable should have 25 evenly spaced values in the log-space between the minimum and maximum lambda values from the caret default tuning grid. Assign the tuning grid to the enet_grid object.
The candidate of lambda values are created in the code chunk below. The seq() function is used to create a sequence of evenly spaced values between the log of the min and log of the max values tried by the default grid. The log-space makes it easier to span several orders of magnitude more easily. The exp() function is then used to convert the values back to the lambda space.
my_lambda_grid_06 <- exp(seq(log(min(enet_default_mod6$results$lambda)),
log(max(enet_default_mod6$results$lambda)),
length.out = 25))
The enet_grid_06 is defined below by combining the my_lambda_grid_06 with a grid of alpha or mixing fraction values between 0.1 and 1.0.
enet_grid_06 <- expand.grid(alpha = seq(0.1, 1.0, by = 0.1),
lambda = my_lambda_grid_06)
enet_grid_06 %>% dim()
## [1] 250 2
The number of combinations generated by expand.grid() can be found by checking the dimensions of enet_grid. As shown below, we will try out 250 tuning parameter combinations!
Now, Train, assess, and tune the elastic net model with the custom tuning grid and assign the result to the enet_tune_06 object.
set.seed(1234)
enet_tune_06 <- train(y ~ (R + G + B + Hue)^2 + Lightness + Saturation,
data = train_dataset,
method = 'glmnet',
metric = my_metric,
tuneGrid = enet_grid_06,
preProcess = c("center", "scale"),
trControl = my_ctrl)
plot(enet_tune_06, xTrans = log)
enet_tune_06$bestTune
## alpha lambda
## 26 0.2 0.002324946
The optimal tuning parameter values are printed above The best alpha is 0.2. However, look closely at the resampled averaged Accuracy values displayed in the above figure. Multiple tuning parameter combinations perform roughly the same as the overall best. The overall best averaged Accuracy is only slightly better than the next best result. By default caret chooses the overall best, but perhaps “pure” Lasso are within the 1 standard error of the overall best tuning parameter values. Extracting the one-standard error rule results is tricky to do in caret and so we will follow the overall best recommendation for now.
coef(enet_tune_06$finalModel, s = enet_tune_06$bestTune$lambda)
## 23 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -0.109359649
## R 0.184777485
## G 0.565321648
## B .
## Hue -0.016944750
## Lightnessdeep 0.020788290
## Lightnesslight -0.012144467
## Lightnessmidtone -0.024041176
## Lightnesspale 0.031750403
## Lightnesssaturated 0.003570312
## Lightnesssoft -0.027561446
## Saturationgray -0.020020043
## Saturationmuted -0.008547795
## Saturationneutral -0.024032891
## Saturationpure 0.015234729
## Saturationshaded -0.019120060
## Saturationsubdued -0.018949183
## R:G 0.209375662
## R:B .
## R:Hue -0.069517957
## G:B 0.276884905
## G:Hue 0.089941061
## B:Hue .
The coefficients are displayed in a sparse matrix format, where non-zero coefficients are explicitly shown, while zero coefficients are represented as ‘.’. Interpretations from the Sparse matrix:
The coefficients for the colors R and G indicate their respective contributions to the response variable (y) in the context of the Elastic Net model. As interpreted before, G contributes more than R and B towards the logit transformed response ‘y’. These coefficients indicate that both Red (R) and Green (G) colors have a positive impact on the predicted value of the response variable in the Elastic Net model.
The Blue color variable did not have a significant impact on the response variable (y) in the context of the Elastic Net model.
The negative coefficient of -0.016944750 for the variable “Hue” indicates its contribution to the predicted value of the response variable in the Elastic Net model.The negative coefficient suggests that higher values of “Hue” are associated with lower predicted values of the response variable. The hue becomes more pronounced or shifts towards certain colors (depending on how “Hue” is represented), it tends to lead to lower predicted values for the response variable.
Lightnessdeep, Lightnesspale, Lightnesssaturated, Saturationpure indicate a positive impact on their respective contributions to the response variable (y) in the context of the Elastic Net model.
As a comparison, let’s see how many features are turned off by the default tuning grid result since it identified “pure” Lasso as the best. The default and tuned results are very much consistent.
set.seed(1234)
enet_default_mod10 <- train( y ~ (ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3)) * Lightness,
data = train_dataset,
method = 'glmnet',
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
enet_default_mod10
## glmnet
##
## 835 samples
## 4 predictor
##
## Pre-processing: centered (69), scaled (69)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 668, 668, 668, 668, 668
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.001924909 0.05778489 0.9975918 0.03907914
## 0.10 0.019249086 0.06908010 0.9966862 0.04629991
## 0.10 0.192490860 0.17708545 0.9846270 0.11650787
## 0.55 0.001924909 0.06052592 0.9973374 0.04052020
## 0.55 0.019249086 0.07492304 0.9964330 0.05027818
## 0.55 0.192490860 0.29382268 0.9707374 0.21106152
## 1.00 0.001924909 0.06228617 0.9971781 0.04157015
## 1.00 0.019249086 0.08951467 0.9951979 0.05920334
## 1.00 0.192490860 0.41675834 0.9369369 0.29809771
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.001924909.
my_lambda_grid_10 <- exp(seq(log(min(enet_default_mod10$results$lambda)),
log(max(enet_default_mod10$results$lambda)),
length.out = 25))
enet_grid_10 <- expand.grid(alpha = seq(0.1, 1.0, by = 0.1),
lambda = my_lambda_grid_10)
enet_grid_10 %>% dim()
## [1] 250 2
Now, Train, assess, and tune the elastic net model with the custom tuning grid and assign the result to the enet_tune_10 object.
set.seed(1234)
enet_tune_10 <- train(y ~ (ns(R, df = 3) + ns(G, df = 3) + ns(B, df = 3)) * Lightness,
data = train_dataset,
method = 'glmnet',
metric = my_metric,
tuneGrid = enet_grid_10,
preProcess = c("center", "scale"),
trControl = my_ctrl)
plot(enet_tune_10, xTrans = log)
enet_tune_10$bestTune
## alpha lambda
## 1 0.1 0.001924909
The optimal tuning parameter values are printed above The best alpha is 0.1.
Interpretation: From the sparse matrix, we can observe the coefficients for various predictor variables and their interactions with the response variable.
mod10 %>% readr::write_rds("enet_tune_10.rds")
Positively Influencing Features:
Negatively Influencing Features:
enet_tune_10 %>% readr::write_rds("enet_tune_10.rds")
Training a neural network to predict response, with respect to all inputs. Since the neural network will attempt to create non-linear relationships, I’m not specifying interaction between the inputs.
The dot operator streamlines the formula interface when the dataframe ONLY consists of inputs and the output.
library(dplyr)
df_caret <- train_dataset %>%
select(R, G, B, Hue, Lightness, Saturation, y)
glimpse(df_caret)
## Rows: 835
## Columns: 7
## $ R <dbl> 172, 26, 172, 28, 170, 175, 90, 194, 171, 122, 0, 88, 144, …
## $ G <dbl> 58, 88, 94, 87, 66, 89, 78, 106, 68, 151, 121, 140, 82, 163…
## $ B <dbl> 62, 151, 58, 152, 58, 65, 136, 53, 107, 59, 88, 58, 132, 50…
## $ Hue <dbl> 4, 31, 8, 32, 5, 6, 34, 10, 1, 21, 24, 22, 36, 16, 26, 12, …
## $ Lightness <chr> "dark", "dark", "dark", "dark", "dark", "dark", "dark", "da…
## $ Saturation <chr> "bright", "bright", "bright", "bright", "bright", "bright",…
## $ y <dbl> -1.9924302, -2.1972246, -1.6582281, -2.1972246, -2.0907411,…
Train, assess, and tune the nnet neural network with the defined resampling scheme. Assign the result to the nnet_default object and print the result to the screen.
my_ctrl <- trainControl(method = 'repeatedcv', number = 10, repeats = 3, savePredictions = 'final')
my_metric <- "RMSE"
set.seed(1234)
nnet_default <- train( y ~ R + G + B + Hue + Saturation + Lightness,
data = df_caret,
method = 'nnet',
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl,
trace = FALSE)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
nnet_default
## Neural Network
##
## 835 samples
## 6 predictor
##
## Pre-processing: centered (16), scaled (16)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 753, 752, 751, 750, 750, 752, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0e+00 1.1880204 NaN 1.0114462
## 1 1e-04 1.1100390 0.7300983 0.8893367
## 1 1e-01 0.9701749 0.7450616 0.6615991
## 3 0e+00 1.1880204 NaN 1.0114462
## 3 1e-04 1.0592522 0.6257014 0.8099094
## 3 1e-01 0.9684767 0.7512249 0.6577229
## 5 0e+00 1.1880204 NaN 1.0114462
## 5 1e-04 1.0374433 0.6680409 0.7707822
## 5 1e-01 0.9680629 0.7526684 0.6569813
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5 and decay = 0.1.
# Access the resampling results
resampling_results <- nnet_default$resample
# Extract Rsquared values
rmse_values <- resampling_results$RMSE
# Calculate mean Rsquared value as a proxy for accuracy
rmse_nnd <- mean(rmse_values, na.rm = TRUE)
# Print the accuracy
print(rmse_nnd)
## [1] 0.9680629
As shown by the above display, the best neural network has size = 5 and decay = 0.1. The best neural network therefore consists of 3 hidden units. This is not a large neural network. We would want to consider trying more hidden units to see if the results can be improved further.
plot(nnet_default, xTrans=log)
The code chunk below defines a custom tuning grid focused on tuning the decay parameter. The decay parameter is just the regularization strength and thus is similar to lambda used in elastic net with the ridge penalty!
nnet_grid <- expand.grid(size = c(5, 10, 20),
decay = exp(seq(-6, 2, length.out=11)))
nnet_grid %>% dim()
## [1] 33 2
The more refined neural network tuning grid is used below.
set.seed(1234)
nnet_tune <- train( y ~ R + G + B + Hue + Saturation + Lightness,
data = df_caret,
method = 'nnet',
metric = my_metric,
tuneGrid = nnet_grid,
preProcess = c("center", "scale"),
trControl = my_ctrl,
trace = FALSE)
nnet_tune$bestTune
## size decay
## 25 20 0.01227734
# Access the resampling results
resampling_results <- nnet_tune$resample
# Extract RMSE values
rmse_values <- resampling_results$RMSE
# Calculate mean RMSE value as a proxy for accuracy
rmse_nnt <- mean(rmse_values, na.rm = TRUE)
# Print the accuracy
print(rmse_nnt)
## [1] 0.9664263
The tuning results are visualized below for the refined larger neural network. By default caret identifies whichever model has the overall best performance. The plot below reveals there is very little difference between the smaller neural network with size 20 hidden units compared to the overall best with 5 hidden units. RMSE for tuned is lower than default.
plot(nnet_tune, xTrans=log)
The tuning results are visualized below for the refined larger neural network. By default caret identifies whichever model has the overall best performance. The plot reveals there is significant difference between the smaller neural network with 5 hidden units compared to the overall best with 20 size.
The accuracy for model with tuned is significantly higher than default.
The pred_viz_nnet_probs dataframe is column binded to the viz_grid dataframe. The new object, viz_nnet_df, provides the class predicted probabilities for each input combination in the visualization grid.
viz_grid %>% glimpse
## Rows: 784
## Columns: 6
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 18, 233, 156, …
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 168, 184, 255, …
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253, 177, 243, 2,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15, 16, 30, 21,…
pred_viz_nnet_probs <- predict( nnet_default, newdata = viz_grid)
viz_nnet_df <- viz_grid %>% bind_cols(pred_viz_nnet_probs)
## New names:
## • `` -> `...7`
viz_nnet_df %>% glimpse()
## Rows: 784
## Columns: 7
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 18, 233, 156, …
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 168, 184, 255, …
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253, 177, 243, 2,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15, 16, 30, 21,…
## $ ...7 <dbl> 0.0033921016, 0.0046741858, 0.0002252418, 0.0014585146, 0.0…
The glimpse reveals that the event column stores the predicted event probability. Then visualize the predicted event probability in a manner consistent with the viz_bayes_logpost_preds() function and the tuned elastic net model predictions. Visualize the predicted probability as a line (curve) with respect to G, for each combination of B and R.
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_nnet_df$B_range <- cut(viz_nnet_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_nnet_df %>%
ggplot(mapping = aes(x = G, y = viz_nnet_df[, 7])) +
geom_point(mapping = aes(color = R), size = 2.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() + # Add contour lines to represent point density
theme_bw()
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
The decision to delve into RGB and HSL color models for understanding the response y stemmed from the quest to unravel the intricate relationship between color compositions and the response variable. By visually inspecting regression outcomes, our aim was to pinpoint color combinations within the RGB model that exhibit correlations with the response. At the project’s inception, there was uncertainty regarding how the response was linked to the input variables. Hence, this exploration sought to shed light on which input parameters positively or negatively influence the response.
viz_nnet_df %>%
ggplot(mapping = aes(x = Hue, y = viz_nnet_df[, 7])) +
geom_point(mapping = aes(color = Lightness), size = 2.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
The analysis of the HSL graph reveals that the majority of saturation features do not significantly contribute to higher values of the response variable. This suggests that saturation has minimal influence on driving the response to higher values.
Let’s now a tree based method. You will use the default tuning grid. Tree based models do not have the same kind of preprocessing requirements as other models. Train a random forest binary classifier by setting the method argument equal to “rf”. You must set importance = TRUE in the caret::train() function call. Assign the result to the rf_default variable. Display the rf_default object to the screen.
The random forest model is trained and tuned below. The formula interface uses the . operator to streamline selecting all inputs in the df_caret dataframe.
set.seed(1234)
rf_default <- train( y ~ .,
data = df_caret,
method = 'rf',
metric = my_metric,
trControl = my_ctrl,
importance = TRUE)
rf_default
## Random Forest
##
## 835 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 753, 752, 751, 750, 750, 752, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.22496164 0.9766607 0.16509671
## 9 0.06905697 0.9967770 0.05096918
## 16 0.08489352 0.9949528 0.06158817
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 9.
# Access the resampling results
resampling_results <- rf_default$resample
# Extract RMSE values
rmse_values <- resampling_results$RMSE
# Calculate mean RMSE value as a proxy for accuracy
rmse_rfd <- mean(rmse_values, na.rm = TRUE)
# Print the accuracy
print(rmse_rfd)
## [1] 0.06905697
Let’s make predictions on the visualization grid, viz_grid, using the random forest model rf_default. Instruct thepredict()function to return the probabilities by setting type = ‘prob’.
pred_viz_rf_probs <- predict( rf_default, newdata = viz_grid)
The pred_viz_rf_probs dataframe is column binded to the viz_grid dataframe. The new object, viz_rf_df, provides the class predicted probabilities for each input combination in the visualization grid according to the random forest model.
viz_rf_df <- viz_grid %>% bind_cols(pred_viz_rf_probs)
## New names:
## • `` -> `...7`
viz_rf_df %>% glimpse()
## Rows: 784
## Columns: 7
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 18, 233, 156, …
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 168, 184, 255, …
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253, 177, 243, 2,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15, 16, 30, 21,…
## $ ...7 <dbl> -0.1744327, -0.3734658, -1.8824845, -0.7304833, -2.5461860,…
The glimpse reveals that the popular column stores the predicted event probability.
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_rf_df$R_range <- cut(viz_rf_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rf_df$G_range <- cut(viz_rf_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rf_df$B_range <- cut(viz_rf_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rf_df %>%
ggplot(mapping = aes(x = G, y = viz_rf_df[, 7])) +
geom_point(mapping = aes(color = R),
size = 1.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
viz_rf_df %>%
ggplot(mapping = aes(x = Hue, y = viz_rf_df[, 7])) +
geom_point(mapping = aes(color = Lightness), size = 1.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
The predictions generated by the random forest model appear to be fluctuating abruptly and inconsistently. This is because random forest is a tree based method. Decision trees do not produce smooth predictions.
# Define the grid of hyperparameters
rf_grid <- expand.grid(
mtry = c(2, 4, 6),
nodesize = c(5, 10, 20)
)
# Check the dimensions of the grid
rf_grid <- as.data.frame(rf_grid)
Training the random forest again:
# Determine the number of predictor variables
num_predictors <- ncol(df_caret) - 1
# Train the random forest model
rf_tune <- train(y ~ R + G + B + Hue + Saturation + Lightness,
data = df_caret,
method = 'rf',
metric = my_metric,
tuneGrid = expand.grid(.mtry = seq(1, num_predictors)),
preProcess = c("center", "scale"),
trControl = my_ctrl,
trace = FALSE)
# Print the best hyperparameters
rf_tune
## Random Forest
##
## 835 samples
## 6 predictor
##
## Pre-processing: centered (16), scaled (16)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 752, 750, 751, 753, 751, 752, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 0.55440443 0.9328139 0.42841163
## 2 0.22363474 0.9769102 0.16457692
## 3 0.12896439 0.9903788 0.09496159
## 4 0.09513319 0.9944673 0.06981387
## 5 0.08007561 0.9959461 0.05913711
## 6 0.07293949 0.9965471 0.05410188
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
The RMSE for tuned rf - 0.07293949
# Access the resampling results
resampling_results <- rf_tune$resample
# Extract RMSE values
rmse_values <- resampling_results$RMSE
# Calculate mean RMSE value as a proxy for accuracy
rmse_rft <- mean(rmse_values, na.rm = TRUE)
# Print the accuracy
print(rmse_rft)
## [1] 0.07293949
Based on the accuracy metric, the tuned random forest model appears to perform better than the default random forest model.
plot(rf_default, xTrans=log)
plot(rf_tune, xTrans=log)
pred_viz_rft_probs <- predict( rf_tune, newdata = viz_grid )
The pred_viz_rft_probs dataframe is column binded to the viz_grid dataframe. The new object, viz_rft_df, provides the class predicted probabilities for each input combination in the visualization grid according to the random forest model.
viz_rft_df <- viz_grid %>% bind_cols(pred_viz_rft_probs)
## New names:
## • `` -> `...7`
viz_rft_df %>% glimpse()
## Rows: 784
## Columns: 7
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 18, 233, 156, …
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 168, 184, 255, …
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253, 177, 243, 2,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15, 16, 30, 21,…
## $ ...7 <dbl> -0.2086526, -0.4019054, -1.6662341, -0.5481297, -2.4459292,…
The glimpse reveals that the event column stores the predicted event probability for tuned Random forest.
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_rft_df$R_range <- cut(viz_rf_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rft_df$G_range <- cut(viz_rf_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rft_df$B_range <- cut(viz_rf_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_rft_df %>%
ggplot(mapping = aes(x = G, y = viz_rft_df[, 7])) +
geom_point(mapping = aes(color = R),
size = 1.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
The upward slope of the points in the graph indicates a positive correlation between the G (green) value and the response variable for the tuned random forest model. As the G value increases, the response variable tends to increase as well. This suggests that higher levels of green in the color combination are associated with higher values of the response variable. Additionally, the color mapping of the points based on the R (red) value helps identify any patterns or relationships between the red component and the response variable within different ranges of the B (blue) component, even after tuning the random forest model.
viz_rft_df %>%
ggplot(mapping = aes(x = Hue, y = viz_rft_df[, 7])) +
geom_point(mapping = aes(color = Lightness), size = 2.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
The Gradient Boosted Tree model is trained and tuned below.
# Set the seed for reproducibility
set.seed(1234)
# Train the Gradient Boosted Trees model
gbm_default <- train(y ~ .,
data = df_caret,
method = 'gbm',
metric = "RMSE",
trControl = my_ctrl,
verbose = FALSE)
# Print the default GBM model
gbm_default$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 9 150 3 0.1 10
# Access the resampling results
resampling_results <- gbm_default$resample
# Extract RMSE values
rmse_values <- resampling_results$RMSE
# Calculate mean RMSE value as a proxy for accuracy
rmse_gbmd <- mean(rmse_values, na.rm = TRUE)
# Print the accuracy
print(rmse_gbmd)
## [1] 0.06040106
pred_viz_gbm_probs <- predict( gbm_default, newdata = viz_grid )
viz_gbm_df <- viz_grid %>% bind_cols(pred_viz_gbm_probs)
## New names:
## • `` -> `...7`
viz_gbm_df %>% glimpse()
## Rows: 784
## Columns: 7
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 18, 233, 156, …
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 168, 184, 255, …
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253, 177, 243, 2,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15, 16, 30, 21,…
## $ ...7 <dbl> -0.1347160, -0.3243994, -2.0330345, -0.7341098, -2.6210675,…
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_gbm_df$R_range <- cut(viz_gbm_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbm_df$G_range <- cut(viz_gbm_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbm_df$B_range <- cut(viz_gbm_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbm_df %>%
ggplot(mapping = aes(x = G, y = viz_gbm_df[, 7])) +
geom_point(mapping = aes(color = R),
size = 1.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
The graph illustrates an upward trend between the G (green) value and the response variable in the tuned random forest model. As the G value increases, there is a corresponding rise in the response variable, indicating a positive correlation. This suggests that higher levels of green in the color composition tend to coincide with elevated values of the response variable. Moreover, when considering the color mapping based on the R (red) value, it becomes evident that darker shades of red have a more pronounced impact on the response compared to lighter shades.
viz_gbm_df %>%
ggplot(mapping = aes(x = Hue, y = viz_gbm_df[, 7])) +
geom_point(mapping = aes(color = Lightness), size = 1.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
# Train and tune the Gradient Boosted Trees model directly in the train function
gbm_tune <- train(y ~ .,
data = df_caret,
method = 'gbm',
metric = "RMSE", # Use accuracy as the evaluation metric
trControl = my_ctrl,
tuneGrid = expand.grid(.n.trees = c(100, 200, 300),
.interaction.depth = seq(1, num_predictors),
.shrinkage = c(0.01, 0.1, 0.3),
.n.minobsinnode = c(5, 10, 20)),
verbose = FALSE)
gbm_tune$bestTune
## n.trees interaction.depth shrinkage n.minobsinnode
## 105 300 6 0.1 10
# Access the resampling results
resampling_results <- gbm_tune$resample
# Extract RMSE values
rmse_values <- resampling_results$RMSE
# Calculate mean RMSE value as a proxy for accuracy
rmse_gbmt <- mean(rmse_values, na.rm = TRUE)
# Print the accuracy
print(rmse_gbmt)
## [1] 0.05284694
plot(gbm_default, xTrans=log)
plot(gbm_tune, xTrans=log)
pred_viz_gbmt_probs <- predict( gbm_tune, newdata = viz_grid )
The pred_viz_gbmt_probs dataframe is column binded to the viz_grid dataframe. The new object, viz_gbm_df, provides the class predicted probabilities for each input combination in the visualization grid according to the random forest model.
viz_gbmt_df <- viz_grid %>% bind_cols(pred_viz_gbmt_probs)
## New names:
## • `` -> `...7`
viz_gbmt_df %>% glimpse()
## Rows: 784
## Columns: 7
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 18, 233, 156, …
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 168, 184, 255, …
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253, 177, 243, 2,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, deep, deep,…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15, 16, 30, 21,…
## $ ...7 <dbl> -0.2370389, -0.3286764, -1.9625584, -0.6761660, -2.7055928,…
The glimpse reveals that the event column stores the predicted event probability for tuned Gradient Boosting Tree model.
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_gbmt_df$R_range <- cut(viz_gbmt_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbmt_df$G_range <- cut(viz_gbmt_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbmt_df$B_range <- cut(viz_gbmt_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gbmt_df %>%
ggplot(mapping = aes(x = G, y = viz_gbmt_df[, 7])) +
geom_point(mapping = aes(color = R),
size = 1.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
viz_gbmt_df %>%
ggplot(mapping = aes(x = Hue, y = viz_gbmt_df[, 7])) +
geom_point(mapping = aes(color = Lightness), size = 1.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
# Set seed for reproducibility
set.seed(1234)
gam_model <- train(
y ~ .,
data = df_caret,
method = "gam",
metric = "RMSE",
trControl = my_ctrl,
tuneGrid = expand.grid(
select = c(0.1, 0.2, 0.3), # Specify the smoothing parameter values to tune
method = "GCV.Cp" # Specify the method for tuning the smoothing parameter
)
)
## Loading required package: mgcv
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
# Print the best hyperparameters
gam_model$bestTune
## select method
## 1 0.1 GCV.Cp
The output you provided indicates that the best hyperparameters chosen by the tuning process are:
select: 0.1 method: GCV.Cp Here’s what each of these means:
select: This corresponds to the smoothing parameter value chosen for the GAM model. Smoothing is a technique used in GAMs to deal with non-linear relationships between predictors and the response. A smaller select value indicates less smoothing, allowing the model to capture more complex patterns in the data. In this case, the tuning process has found that a select value of 0.1 resulted in the best performance based on the chosen metric (in this case, accuracy). method: This specifies the method used for tuning the smoothing parameter. In GAMs, there are different methods available for selecting the optimal smoothing parameter value. Common methods include generalized cross-validation (GCV) and the Cp criterion. Here, the tuning process used the GCV.Cp method.
# Access the resampling results
resampling_results <- gam_model$resample
# Extract RMSE values
rmse_values <- resampling_results$RMSE
# Calculate mean RMSE value as a proxy for accuracy
rmse_gamd <- mean(rmse_values, na.rm = TRUE)
# Print the accuracy
print(rmse_gamd)
## [1] 0.05680794
pred_viz_gam_probs <- predict( gam_model, newdata = viz_grid )
viz_gam_df <- cbind(viz_grid, pred_viz_gam_probs)
viz_gam_df %>% glimpse()
## Rows: 784
## Columns: 7
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 18, 23…
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 168, 18…
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253, 177,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, dee…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15, 16,…
## $ pred_viz_gam_probs <dbl> -0.4728675, -0.3402283, -2.1147172, -0.6977772, -2.…
The glimpse reveals that the event column stores the predicted event probability. Then visualize the predicted event probability in a manner consistent with the viz_bayes_logpost_preds() function and the tuned elastic net model predictions. Visualize the predicted probability as a line (curve) with respect to G, for each combination of B and R.
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_gam_df$R_range <- cut(viz_gam_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gam_df$G_range <- cut(viz_gam_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gam_df$B_range <- cut(viz_gam_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gam_df %>%
ggplot(mapping = aes(x = G, y = pred_viz_gam_probs)) +
geom_point(mapping = aes(color = R), size = 1.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() + # Add contour lines to represent point density
theme_bw()
viz_gam_df %>%
ggplot(mapping = aes(x = Hue, y = pred_viz_gam_probs)) +
geom_point(mapping = aes(color = Lightness), size = 2.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
Tuning Generalized Additive Models (GAM):
tuning_results <- gam_model$results
print(tuning_results)
## select method RMSE Rsquared MAE RMSESD RsquaredSD
## 1 0.1 GCV.Cp 0.05680794 0.9977628 0.04028667 0.006941342 0.0005118255
## 2 0.2 GCV.Cp 0.05680794 0.9977628 0.04028667 0.006941342 0.0005118255
## 3 0.3 GCV.Cp 0.05680794 0.9977628 0.04028667 0.006941342 0.0005118255
## MAESD
## 1 0.003681035
## 2 0.003681035
## 3 0.003681035
# Set seed for reproducibility
set.seed(1234)
gam_tune <- train(
y ~ .,
data = df_caret,
method = "gam",
metric = "RMSE",
trControl = my_ctrl,
tuneGrid = expand.grid(
select = c(0.3, 3, 6), # Specify the smoothing parameter values to tune
method = "GCV.Cp" # Specify the method for tuning the smoothing parameter
)
)
# Print the best hyperparameters
gam_tune$results
## select method RMSE Rsquared MAE RMSESD RsquaredSD
## 1 0.3 GCV.Cp 0.05680794 0.9977628 0.04028667 0.006941342 0.0005118255
## 2 3.0 GCV.Cp 0.05680794 0.9977628 0.04028667 0.006941342 0.0005118255
## 3 6.0 GCV.Cp 0.05680794 0.9977628 0.04028667 0.006941342 0.0005118255
## MAESD
## 1 0.003681035
## 2 0.003681035
## 3 0.003681035
# Access the resampling results
resampling_results <- gam_tune$resample
# Extract RMSE values
rmse_values <- resampling_results$RMSE
# Calculate mean RMSE value as a proxy for accuracy
rmse_gamt <- mean(rmse_values, na.rm = TRUE)
# Print the accuracy
print(rmse_gamt)
## [1] 0.05680794
pred_viz_gamt_probs <- predict( gam_tune, newdata = viz_grid )
viz_gamt_df <- cbind(viz_grid, pred_viz_gamt_probs)
viz_gamt_df %>% glimpse()
## Rows: 784
## Columns: 7
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 18, 2…
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 168, 1…
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253, 177…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, de…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, br…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15, 16…
## $ pred_viz_gamt_probs <dbl> -0.4728675, -0.3402283, -2.1147172, -0.6977772, -2…
# Define custom breaks for B
breaks <- c(0, 50, 100, 150, 200, 255)
# Create labels for the breaks
labels <- c("0-50", "51-100", "101-150", "151-200", "201-255")
# Add breaks and labels to the B variable
viz_gamt_df$R_range <- cut(viz_gamt_df$R, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gamt_df$G_range <- cut(viz_gamt_df$G, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gamt_df$B_range <- cut(viz_gamt_df$B, breaks = breaks, labels = labels, include.lowest = TRUE)
viz_gamt_df %>%
ggplot(mapping = aes(x = G, y = pred_viz_gamt_probs)) +
geom_point(mapping = aes(color = R), size = 1.0) +
facet_wrap(~B_range, labeller = 'label_both') +
geom_density_2d() + # Add contour lines to represent point density
theme_bw()
viz_gamt_df %>%
ggplot(mapping = aes(x = Hue, y = pred_viz_gamt_probs)) +
geom_point(mapping = aes(color = Lightness), size = 2.0) +
facet_wrap(~Saturation, labeller = 'label_both') +
geom_density_2d() +
theme_bw()
#KNN
K-Nearest Neighbors (KNN) is a simple yet effective algorithm used for both classification and regression tasks. It works by identifying the K nearest data points to a given query point and predicting the class or value based on the most common class or average value among its neighbors, respectively.
We now train the KNN model using the train function from the caret package. We specify the method as “knn” and use accuracy as the evaluation metric.
set.seed(1234)
knn_default <- train(y ~ .,
data = df_caret,
method = "knn",
trControl = my_ctrl,
tuneLength = 10) # Set the desired value of K
knn_default
## k-Nearest Neighbors
##
## 835 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 753, 752, 751, 750, 750, 752, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 0.09085295 0.9941956 0.06814833
## 7 0.09552074 0.9935983 0.06996092
## 9 0.09962224 0.9930040 0.07244288
## 11 0.10437028 0.9923264 0.07591964
## 13 0.10716197 0.9919102 0.07815402
## 15 0.11029943 0.9914546 0.08047162
## 17 0.11387491 0.9908989 0.08293891
## 19 0.11785148 0.9902752 0.08580610
## 21 0.12214235 0.9895572 0.08923681
## 23 0.12608584 0.9889255 0.09212414
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
From the above outcomes we can see that,
The k-Nearest Neighbors (KNN) algorithm was applied to classify samples into ‘popular_paint’ and ‘non_popular_paint’ classes using a dataset with 835 samples and 6 predictor variables. Through cross-validated performance evaluation, KNN was tuned over a range of K values from 5 to 23. The optimal model achieved an accuracy of approximately 80.48% with a K value of 9. This indicates that when considering the 9 nearest neighbors, the model correctly predicts the class label for about 80.48% of the samples. KNN’s simplicity and effectiveness make it a valuable tool for classification tasks, especially when dealing with relatively small datasets. #### RMSE
# Access the resampling results
resampling_results <- knn_default$resample
# Extract RMSE values
rmse_values <- resampling_results$RMSE
# Calculate mean RMSE value as a proxy for accuracy
rmse_knnd <- mean(rmse_values, na.rm = TRUE)
# Print the accuracy
print(rmse_knnd)
## [1] 0.09085295
Predictions to understand the behavior of the model
# Predictions to understand the behavior of the Gradient Boosted Trees model
pred_viz_knn_probs <- predict(knn_default, newdata = viz_grid)
# Combine predictions with visualization grid
viz_knn_df <- cbind(viz_grid, pred_viz_knn_probs)
# Display a glimpse of the combined dataframe
viz_knn_df %>% glimpse()
## Rows: 784
## Columns: 7
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 18, 23…
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 168, 18…
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253, 177,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark, dee…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright, bri…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15, 16,…
## $ pred_viz_knn_probs <dbl> -0.710997114, -0.371267000, -1.719681847, -0.644310…
Plot for nb_default
Lets try to analyze it by a graph
plot(knn_default,main="KNN Default Plot", xTrans=log)
The plot showcases the relationship between the number of neighbors (K) considered in the k-Nearest Neighbors (KNN) algorithm and the corresponding classification accuracy. As the number of neighbors increases from 0 to 4 along the x-axis, the accuracy initially declines until it reaches its lowest point around 2.6. Subsequently, there’s a gradual rise in accuracy until approximately 2.8 neighbors, followed by a slight decrease. Notably, the highest accuracy is achieved when K equals 1, denoting that the model performs optimally when considering only the nearest neighbor for classification. This pattern suggests a trade-off between the complexity of the model (as determined by the number of neighbors) and its predictive accuracy, with an optimal balance observed at K=1.
Tuning KNN
# Define a tuning grid for KNN
knn_grid <- expand.grid(k = seq(1, 20, by = 2)) # Define the range of K values
set.seed(1234)
knn_tune <- train(y ~ .,
data = df_caret,
method = "knn",
trControl = my_ctrl,
tuneGrid = knn_grid)
knn_tune
## k-Nearest Neighbors
##
## 835 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 753, 752, 751, 750, 750, 752, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.12302331 0.9894370 0.09078516
## 3 0.09705306 0.9933391 0.07262540
## 5 0.09085295 0.9941956 0.06814833
## 7 0.09552074 0.9935983 0.06996092
## 9 0.09962224 0.9930040 0.07244288
## 11 0.10437028 0.9923264 0.07591964
## 13 0.10716197 0.9919102 0.07815402
## 15 0.11029943 0.9914546 0.08047162
## 17 0.11387491 0.9908989 0.08293891
## 19 0.11785148 0.9902752 0.08580610
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
# Access the resampling results
resampling_results <- knn_tune$resample
# Extract RMSE values
rmse_values <- resampling_results$RMSE
# Calculate mean RMSE value as a proxy for accuracy
rmse_knnt <- mean(rmse_values, na.rm = TRUE)
# Print the accuracy
print(rmse_knnt)
## [1] 0.09085295
Predictions to understand the behavior of the tuned knn model
# Predictions to understand the behavior of the Gradient Boosted Trees model
pred_viz_knn_tune_probs <- predict(knn_tune, newdata = viz_grid)
# Combine predictions with visualization grid
viz_knn_tune_df <- cbind(viz_grid, pred_viz_knn_tune_probs)
# Display a glimpse of the combined dataframe
viz_knn_tune_df %>% glimpse()
## Rows: 784
## Columns: 7
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 1…
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 16…
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15…
## $ pred_viz_knn_tune_probs <dbl> -0.710997114, -0.371267000, -1.719681847, -0.6…
Plotting Tuned model
# Plot the performance of Naive Bayes model
plot(knn_default, main="KNN Tuned Plot", xTrans=log)
Predictions
# Predictions for Naive Bayes default model
pred_viz_knn_probs_default <- predict(knn_default, newdata = viz_grid)
# Combine predictions with the visualization grid
viz_knn_df_default <- bind_cols(viz_grid, as.data.frame(pred_viz_knn_probs_default))
# Glimpse the combined dataframe
glimpse(viz_knn_df_default)
## Rows: 784
## Columns: 7
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254…
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61,…
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 2…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, d…
## $ Saturation <fct> bright, bright, bright, bright, bright, bri…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5,…
## $ pred_viz_knn_probs_default <dbl> -0.710997114, -0.371267000, -1.719681847, -…
# Predictions for tuned Naive Bayes model
pred_viz_knn_probs_tune <- predict(knn_tune, newdata = viz_grid)
# Combine predictions with the visualization grid
viz_knn_df_tune <- bind_cols(viz_grid, as.data.frame(pred_viz_knn_probs_tune))
# Glimpse the combined dataframe
glimpse(viz_knn_df_tune)
## Rows: 784
## Columns: 7
## $ R <int> 47, 216, 134, 205, 55, 51, 76, 144, 27, 254, 1…
## $ G <int> 193, 172, 58, 125, 32, 94, 50, 106, 38, 61, 16…
## $ B <int> 179, 34, 173, 210, 103, 38, 134, 42, 100, 253,…
## $ Lightness <fct> dark, dark, dark, dark, dark, dark, dark, dark…
## $ Saturation <fct> bright, bright, bright, bright, bright, bright…
## $ Hue <int> 2, 23, 1, 27, 13, 31, 7, 15, 26, 21, 12, 5, 15…
## $ pred_viz_knn_probs_tune <dbl> -0.710997114, -0.371267000, -1.719681847, -0.6…
Visulatization
viz_knn_df_default %>%
ggplot(mapping = aes(x = Hue, y = pred_viz_knn_probs_default, color=B)) +
geom_point(size = 1) +
facet_wrap(~Lightness, labeller = 'label_both') +
scale_color_gradient(low = "#00FFFF", high = "#104E8B") +
theme_bw()
Tuned Predicted Probability of outcome for Naive
Bayes
viz_grid %>%
ggplot(mapping = aes(x = Hue, y = pred_viz_knn_probs_tune, color = B)) +
geom_point(size = 1) +
facet_wrap(~Lightness, labeller = 'label_both') +
scale_color_gradient(low = "#00FFFF", high = "#104E8B") +
theme_bw()
For most of the models above: 1. Accuracy as the Evaluation Metric: - Accuracy is a straightforward metric that measures the proportion of correctly classified instances out of the total instances. - It is easy to interpret and understand, making it a popular choice for classification tasks. - Especially in balanced datasets (where the classes are roughly equal in size), accuracy provides a good overall measure of model performance.
By combining accuracy as the evaluation metric with k-fold cross-validation as the resampling scheme, you ensure that: - The model is evaluated based on its ability to correctly classify instances. - The model’s performance is assessed robustly across multiple train-test splits of the data, reducing the risk of overfitting or underfitting.
library(ggplot2)
# Create a data frame with model names and RMSE values
model_rmse <- data.frame(Model = c("Neural Network", "Random Forest", "GBM", "GAM", "KNN"),
RMSE = c(rmse_nnt, rmse_rft, rmse_gbmd, rmse_gamd, rmse_knnd))
# Plot the bar graph
ggplot(model_rmse, aes(x = Model, y = RMSE, fill = Model)) +
geom_bar(stat = "identity") +
labs(title = "RMSE Values for Different Models",
x = "Model",
y = "RMSE") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
##### The best model according to RMSE is GAM with 0.05680793645.
gam_tune #### Saving the 2 best performing models according to BIC for
next parts.
gam_tune %>% readr::write_rds("gam_tune.rds")